⚠️ Tutorial will be recorded

This tutorial, including the chat, will be recorded, so that you can re-visit the instructions later, as needed. This includes private chats. It is expected that you are respectful of your class members and tutors, while actively engaging in the work.

🎯 Objectives

These are exercises in plots to make to explore relationships between multiple variables. You will use interactive scatterplot matrices, interactive parallel coordinate plots and tours to explore the world beyond 2D.

🔧 Preparation

Install the following R-packages if you do not have it already:

install.packages(c("tourr", "spinifex", "plotly", "vcd", "SMPracticals", "vcdExtra", "binostics", "RColorBrewer"))
remotes::install_github("ggobi/GGally")

GRAB A COPY OF THE .Rmd FILE TO GET HELPFUL CODE FOR STARTING ON THE PROBLEMS If you see ??? in the code this is a place where you will need to fill in something!

Exercise 0: Introduction

In the chat window, say hello, and if you feel comfortable tell us something fun about yourself, or what you have done this last week.

Some of the following are questions based on those in Unwin (2015) Graphical Data Analysis with R Chapter 5 and 6.

Exercise 1: Melbourne housing

  1. Read in a copy of the Melbourne housing data from Nick Tierney’s github repo which is a collation from the version at kaggle. Its fairly large, so let’s start simply, and choose two suburbs to focus on. I recommend “South Yarra” and “Brighton”. (Note: there are a number of missing values. I recommend removing these before making plots.)
  2. Make a scatterplot matrix of price, rooms, bedroom2, bathroom, suburb, type. The plot will be easier to read if you put the numerical variables first, and then the categorical variables. What are the associations that can be seen?
  3. Subset the data to South Yarra only. Make an interactive scatterplot matrix of rooms, bedroom2, bathroom and price, coloured by type of property. There is a really high price property. Select this case, and determine what’s special about it – why did it sell for so much? Select the outlier in bedrooms and bathrooms, and examine the other characteristics of this property.
  4. Examine price vs rooms coloured by bathrooms, faceted by suburb and type, and with a linear model overlaid. What do you learn about average house prices relative to number of rooms and number of bathrooms, for the different property types and suburbs? (Remove the one really high priced property first, because it affects what we can learn about the rest of the data.)
  5. If we throw all the neighbourhoods in together to analyse price and property characteristics, what pitfall might we encounter?

Exercise 2: Olive oils

Following on from the olive oils example from class, we will explore the oils from the south here.

  1. Grab a copy of the data, and subset to contain just the samples from region = south (1), and also drop eicosenoic acid, because there is nothing useful about this variable for the southern oils.
  2. Only looking at areas (1-3), that is not Sicily:
    • Make an interactive parallel coordinate plot of the fatty acids (except eicosenoic), where the lines are coloured by area. (Code is provided, code is a bit tricky, but worth it!)
    • Look at the data in a tour.
    • Describe what you learn about differences between the three areas, whether these are separated. Are some variables more useful for distinguishing the three areas? Are there any outliers?
  3. Re-do b. with Sicily. Explain what you learn about Sicily relative to the other areas.
  4. Do some googling. What can you find out about Sicilian olive oils? Are they higher in value? Does Sicily even grow olives, or does it use olives from neighbouring areas?

Exercise 3: Baker field soils

  1. Make density plots of the soil variables in the Baker field corn yield data. Choose an appropriate transformation to symmetrise the distribution.
  2. Make a scatterplot matrix. If you can make an interactive one, that would be extra special. Describe the relationships between pairs of variables.
  3. Make a grand tour of soil variables. Describe the different patterns that you see in various projections. Is there clustering? Is there linear dependence? Non-linear dependence? outliers. For any structure that you see determine which variables contribute to it, and make plots of these variables (or check the scatterplot matrix) to check whether the pattern is visible there too.

Exercise 4: Exam marks

There is a dataset “mathmarks” in the SMPracticals package, which has marks out of 100 for 88 students. It is interesting to note that all students had marks for all tests, which makes one wonder whether marks for students who missed a test were dropped. Mechanics and vectors were closed book exams, and the others were open book.

  1. Make a side-by-side boxplot of the test scores. What do you learn about the test scores on the different subjects?
  2. Make a scatterplot matrix, even better if it is interactive. Describe the relationships between the tests. Is there something different about the open book vs closed book scores?
  3. Make an interactive parallel coordinate plot. Are there some students who have done consistently well on all tests? Consistently badly on all tests? Badly on some but better on others?

Exercise 5: Knowledge and resources

The “vcdExtra” package contains a dataset “Dyke” about how 1729 survey respondents’ knowledge of cancer depended on whether they listened to the radio, read newspapers, did solid reading, or attended lectures.

  1. Make separate bar charts for each of the explanatory variables, with bars filled by the response variable Knowledge. What do you learn?
  2. Make a 100% bar chart of Newspaper, with Knowledge mapped to fill, and faceted by Reading. What do you learn about the relative proportions in the groups?
  3. Make a doubledecker plot of the data. What combination of factors leads to the highest level of knowledge about cancer? What combination leads to the lowest?

Exercise 6: Parkinsons

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson’s disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals (“name” column). The main aim of the data is to discriminate healthy people from those with PD, according to “status” column which is set to 0 for healthy and 1 for PD.

The data is available at The UCI Machine Learning Repository in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. There are 24 variables in the file, including the persons name in column 1.

The data are originally analysed in: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), ‘Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease’, IEEE Transactions on Biomedical Engineering (to appear).

  1. Compute the scagnostics for all pairs of variables, except for name.
  2. Sort the scagnostics, show the top 10 on (i) Monotonic (ii) Clumpy (iii) Your choice, and plot the pair of variables with the highest values.
  3. Make an interactive scatterplot matrix. Browse over it to choose other interesting pairs of variables and make the plots.
  4. The scagnostics help us to find interesting associations between pairs of variables. However, the problem here is to detect differences between Parkinsons patients and normal patients. How would you go about that? Think about some ideas long the line of scagnostics but look for differences between the two groups.
##                                  Monotonic
## Shimmer:APQ3 vs Shimmer:DDA      0.9999998
## MDVP:RAP vs Jitter:DDP           0.9999993
## MDVP:Shimmer vs MDVP:Shimmer(dB) 0.9823538
## MDVP:Shimmer vs Shimmer:DDA      0.9782364
## MDVP:Shimmer vs Shimmer:APQ3     0.9781855
## spread1 vs PPE                   0.9650254

## # A tibble: 6 x 2
##   var               d
##   <chr>         <dbl>
## 1 MDVP:Fo(Hz)   0.870
## 2 HNR           0.647
## 3 MDVP:Fhi(Hz)  0.518
## 4 MDVP:Flo(Hz)  0.211
## 5 NHR          -0.243
## 6 MDVP:RAP     -0.356